ECON 323: Predicting Rice Crop Yield In Vietnam¶

Authors: Divyadarshan Punjabi; Nirvaan Rohira; and Yasin Zahir¶

Outline¶

  • Introduction and Current Literature
  • Setting up the import statements and the themes
  • Loading the Datasets
  • Feature Engineering
    • Lets visualize the data coming from Landsat and Sentinel 1 satellites for a sample location
  • Let us calculate our Features (Predictor Variables)
    • From Landsat
    • From Sentinel 1
  • Now let's start building the model
    • Linear Regression
    • Lasso Regression
    • Feature Extraction
    • ML Regression Models
    • Boosting
  • Bottlenecks and Methods of Improvements
  • Conclusion
  • References
  • Contributions

Introduction and Current Literature ¶

In 2020, between "720 to 811 million people worldwide were suffering from hunger." Unfortunately, even more people (about 2.4 Billion people) are moderately to severely food insecure (UN SDG Indicators). Due to this our group decided to address this by adhereing to the UN Sustainable Development Goals, especially Goal 2: Zero Hunger. We do so by aiming to create a regression model for farmers to predict crop yield using just satellite data and some ground-truth data. Vietnam is a low-income country with limited resources and a major exporter of rice. Hence, we choose to predict rice crop yield in Vietnam.

Currently, the highest accuracy achieved in predicting any crop yield, after decades of research, is 0.86 (Seungtaek Jeong et al. (2021)).

Setting up the import statements and the themes ¶

In [1]:
pip install xgboost catboost
Requirement already satisfied: xgboost in /srv/conda/envs/notebook/lib/python3.10/site-packages (1.7.5)
Requirement already satisfied: catboost in /srv/conda/envs/notebook/lib/python3.10/site-packages (1.1.1)
Requirement already satisfied: scipy in /srv/conda/envs/notebook/lib/python3.10/site-packages (from xgboost) (1.9.1)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.10/site-packages (from xgboost) (1.22.4)
Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (3.5.3)
Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (5.14.1)
Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (1.16.0)
Requirement already satisfied: pandas>=0.24.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (1.4.4)
Requirement already satisfied: graphviz in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (0.20.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2022.2.1)
Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (4.37.2)
Requirement already satisfied: pyparsing>=2.2.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: packaging>=20.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (9.2.0)
Requirement already satisfied: tenacity>=6.2.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from plotly->catboost) (8.0.1)
Note: you may need to restart the kernel to use updated packages.
In [2]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization
import ipyleaflet
import matplotlib.pyplot as plt
from IPython.display import Image
import seaborn as sns

# Data Science
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from scipy.interpolate import interp1d
from datetime import datetime, timedelta

# Machine Learning
import graphviz
import xgboost as xgb
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor, AdaBoostRegressor
from sklearn.model_selection import learning_curve
from sklearn.feature_selection import RFECV

# Planetary Computer Tools
import pystac
import pystac_client
import odc
from pystac_client import Client
from pystac.extensions.eo import EOExtension as eo
from odc.stac import stac_load
import planetary_computer as pc

# Import common GIS tools
import pandas as pd
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import rasterio.features
import rioxarray as rio
import xrspatial.multispectral as ms
from contextlib import redirect_stdout
import os
import sys

# Pass your API key here
pc.settings.set_subscription_key('71113eb526914deba51505c8b5463b88')

Loading the Dataset ¶

In [3]:
locations = pd.read_csv("Crop_Yield_Data.csv") # Ground Truth Data
weather_data = pd.read_csv('Chau Phu, Vietnam 2021-10-01 to 2022-09-01.csv')
weather_data1 = pd.read_csv('Chau Thanh, Vietnam 2021-10-01 to 2022-09-01.csv')
weather_data2 = pd.read_csv('Thoai Son 2021-10-01 to 2022-09-01.csv')

locations.head()
Out[3]:
District Latitude Longitude Season(SA = Summer Autumn, WS = Winter Spring) Rice Crop Intensity(D=Double, T=Triple) Date of Harvest Field size (ha) Rice Yield (kg/ha)
0 Chau_Phu 10.510542 105.248554 SA T 15-07-2022 3.40 5500
1 Chau_Phu 10.509150 105.265098 SA T 15-07-2022 2.43 6000
2 Chau_Phu 10.467721 105.192464 SA D 15-07-2022 1.95 6400
3 Chau_Phu 10.494453 105.241281 SA T 15-07-2022 4.30 6000
4 Chau_Phu 10.535058 105.252744 SA D 14-07-2022 3.30 6400
In [4]:
locations.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 557 entries, 0 to 556
Data columns (total 8 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   District                                        557 non-null    object 
 1   Latitude                                        557 non-null    float64
 2   Longitude                                       557 non-null    float64
 3   Season(SA = Summer Autumn, WS = Winter Spring)  557 non-null    object 
 4   Rice Crop Intensity(D=Double, T=Triple)         557 non-null    object 
 5   Date of Harvest                                 557 non-null    object 
 6   Field size (ha)                                 557 non-null    float64
 7   Rice Yield (kg/ha)                              557 non-null    int64  
dtypes: float64(3), int64(1), object(4)
memory usage: 34.9+ KB
In [5]:
weather_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 33 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              336 non-null    object 
 1   datetime          336 non-null    object 
 2   tempmax           336 non-null    float64
 3   tempmin           336 non-null    float64
 4   temp              336 non-null    float64
 5   feelslikemax      336 non-null    float64
 6   feelslikemin      336 non-null    float64
 7   feelslike         336 non-null    float64
 8   dew               336 non-null    float64
 9   humidity          336 non-null    float64
 10  precip            336 non-null    float64
 11  precipprob        336 non-null    int64  
 12  precipcover       336 non-null    float64
 13  preciptype        274 non-null    object 
 14  snow              235 non-null    float64
 15  snowdepth         235 non-null    float64
 16  windgust          238 non-null    float64
 17  windspeed         336 non-null    float64
 18  winddir           336 non-null    float64
 19  sealevelpressure  336 non-null    float64
 20  cloudcover        336 non-null    float64
 21  visibility        336 non-null    float64
 22  solarradiation    336 non-null    float64
 23  solarenergy       336 non-null    float64
 24  uvindex           336 non-null    int64  
 25  severerisk        235 non-null    float64
 26  sunrise           336 non-null    object 
 27  sunset            336 non-null    object 
 28  moonphase         336 non-null    float64
 29  conditions        336 non-null    object 
 30  description       336 non-null    object 
 31  icon              336 non-null    object 
 32  stations          336 non-null    object 
dtypes: float64(22), int64(2), object(9)
memory usage: 86.8+ KB

Feature Engineering ¶

Rice crop phenology examines growth stages and timings in rice plants. Phenology curves track development parameters like leaf area index, biomass, and vegetation indices. These curves aid in monitoring crops, predicting yields, and planning agricultural practices in Vietnam. By combining weather data with phenology curves, we can use growth stages as predictors in a data-driven ML model.

Lets visualize the data coming from Landsat and Sentinel 1 satellites for a sample location ¶

First, we define our area of interest using the centroid's latitude and longitude coordinates. We then establish the size of the surrounding bounding box (in degrees) and set a time window for a typical rice growing season.

Bounding boxes are rectangular regions outlining areas of interest in remote sensing imagery, represented by (xmin, ymin, xmax, ymax) coordinates. Landsat's 30-meter and Sentinel-1's 10-meter resolutions balance data volume, swath widths, revisit times, and applications like land coverage and agriculture.

In [6]:
# Sample Rice Crop Field in An Giang, Vietnam
lat_long = (10.4391, 105.3338) # Lat-Lon centroid location
box_size_deg1 = 0.10 # Surrounding box in degrees

# Calculate the Lat-Lon bounding box region
min_lon = lat_long[1]-box_size_deg1/2
min_lat = lat_long[0]-box_size_deg1/2
max_lon = lat_long[1]+box_size_deg1/2
max_lat = lat_long[0]+box_size_deg1/2
bounds = (min_lon, min_lat, max_lon, max_lat)

# Define the time window
time_window="2021-12-01/2022-04-30"

# Define the pixel resolution for the final product
# Define the scale according to our selected crs, so we will use degrees
resolution1 = 30  # meters per pixel 
scale1 = resolution1 / 111320.0 # degrees per pixel for CRS:4326 

For Landsat¶

Using the pystac_client we can search the Planetary Computer's STAC catalog for items matching our query parameters. The result is the number of scenes matching our search criteria that touch our area of interest. Some of these may be partial scenes and may contain clouds.

In [7]:
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = stac.search(
    collections=["landsat-c2-l2"], 
    bbox=bounds, 
    datetime=time_window,
    query={"platform": {"in": ["landsat-8", "landsat-9"]},},
)
items = list(search.get_all_items())
print('This is the number of scenes that touch our region:',len(items))
This is the number of scenes that touch our region: 31
In [8]:
xx = stac_load(
    items,
    bands=["red", "green", "blue", "nir08", "qa_pixel"],
    crs="EPSG:4326", # Latitude-Longitude
    resolution=scale1, # Degrees
    chunks={"x": 2048, "y": 2048},
    patch_url=pc.sign,
    bbox=bounds
)

# Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
# https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
xx['red'] = (xx['red']*0.0000275)-0.2
xx['green'] = (xx['green']*0.0000275)-0.2
xx['blue'] = (xx['blue']*0.0000275)-0.2
xx['nir08'] = (xx['nir08']*0.0000275)-0.2

# View the dimensions of our XARRAY and the variables
display(xx)
<xarray.Dataset>
Dimensions:      (latitude: 372, longitude: 372, time: 31)
Coordinates:
  * latitude     (latitude) float64 10.49 10.49 10.49 ... 10.39 10.39 10.39
  * longitude    (longitude) float64 105.3 105.3 105.3 ... 105.4 105.4 105.4
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2021-12-01T03:20:47.203656 ... 2022-04...
Data variables:
    red          (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    green        (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    blue         (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    nir08        (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    qa_pixel     (time, latitude, longitude) uint16 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
xarray.Dataset
    • latitude: 372
    • longitude: 372
    • time: 31
    • latitude
      (latitude)
      float64
      10.49 10.49 10.49 ... 10.39 10.39
      units :
      degrees_north
      resolution :
      -0.00026949335249730504
      crs :
      EPSG:4326
      array([10.489086, 10.488816, 10.488547, ..., 10.389642, 10.389373, 10.389103])
    • longitude
      (longitude)
      float64
      105.3 105.3 105.3 ... 105.4 105.4
      units :
      degrees_east
      resolution :
      0.00026949335249730504
      crs :
      EPSG:4326
      array([105.283911, 105.284181, 105.28445 , ..., 105.383354, 105.383624,
             105.383893])
    • spatial_ref
      ()
      int32
      4326
      spatial_ref :
      GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble",MEMBER["World Geodetic System 1984 (Transit)"],MEMBER["World Geodetic System 1984 (G730)"],MEMBER["World Geodetic System 1984 (G873)"],MEMBER["World Geodetic System 1984 (G1150)"],MEMBER["World Geodetic System 1984 (G1674)"],MEMBER["World Geodetic System 1984 (G1762)"],MEMBER["World Geodetic System 1984 (G2139)"],ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[2.0]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],CS[ellipsoidal,2],AXIS["geodetic latitude (Lat)",north,ORDER[1],ANGLEUNIT["degree",0.0174532925199433]],AXIS["geodetic longitude (Lon)",east,ORDER[2],ANGLEUNIT["degree",0.0174532925199433]],USAGE[SCOPE["Horizontal component of 3D system."],AREA["World."],BBOX[-90,-180,90,180]],ID["EPSG",4326]]
      crs_wkt :
      GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble",MEMBER["World Geodetic System 1984 (Transit)"],MEMBER["World Geodetic System 1984 (G730)"],MEMBER["World Geodetic System 1984 (G873)"],MEMBER["World Geodetic System 1984 (G1150)"],MEMBER["World Geodetic System 1984 (G1674)"],MEMBER["World Geodetic System 1984 (G1762)"],MEMBER["World Geodetic System 1984 (G2139)"],ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[2.0]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],CS[ellipsoidal,2],AXIS["geodetic latitude (Lat)",north,ORDER[1],ANGLEUNIT["degree",0.0174532925199433]],AXIS["geodetic longitude (Lon)",east,ORDER[2],ANGLEUNIT["degree",0.0174532925199433]],USAGE[SCOPE["Horizontal component of 3D system."],AREA["World."],BBOX[-90,-180,90,180]],ID["EPSG",4326]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314245179
      inverse_flattening :
      298.257223563
      reference_ellipsoid_name :
      WGS 84
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      WGS 84
      grid_mapping_name :
      latitude_longitude
      array(4326, dtype=int32)
    • time
      (time)
      datetime64[ns]
      2021-12-01T03:20:47.203656 ... 2...
      array(['2021-12-01T03:20:47.203656000', '2021-12-09T03:20:46.977108000',
             '2021-12-10T03:14:36.030555000', '2021-12-18T03:14:31.113564000',
             '2021-12-25T03:20:42.957971000', '2021-12-26T03:14:31.822391000',
             '2022-01-02T03:20:40.283448000', '2022-01-03T03:14:31.372292000',
             '2022-01-10T03:20:40.399282000', '2022-01-11T03:14:28.733155000',
             '2022-01-18T03:20:37.922626000', '2022-01-19T03:14:30.003400000',
             '2022-01-26T03:20:41.842180000', '2022-01-27T03:14:23.128333000',
             '2022-02-03T03:20:33.610922000', '2022-02-04T03:14:30.653162000',
             '2022-02-11T03:20:39.348336000', '2022-02-12T03:14:20.499449000',
             '2022-02-19T03:20:27.771564000', '2022-02-20T03:14:23.439714000',
             '2022-02-27T03:20:28.774876000', '2022-03-07T03:20:25.278657000',
             '2022-03-16T03:14:10.125549000', '2022-03-23T03:20:15.728245000',
             '2022-03-24T03:14:12.685366000', '2022-04-01T03:13:58.770662000',
             '2022-04-08T03:20:13.150736000', '2022-04-09T03:14:09.306474000',
             '2022-04-16T03:20:15.559889000', '2022-04-24T03:20:13.200949000',
             '2022-04-25T03:13:59.967091000'], dtype='datetime64[ns]')
    • red
      (time, latitude, longitude)
      float64
      dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
      Array Chunk
      Bytes 32.73 MiB 1.06 MiB
      Shape (31, 372, 372) (1, 372, 372)
      Count 3 Graph Layers 31 Chunks
      Type float64 numpy.ndarray
      372 372 31
    • green
      (time, latitude, longitude)
      float64
      dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
      Array Chunk
      Bytes 32.73 MiB 1.06 MiB
      Shape (31, 372, 372) (1, 372, 372)
      Count 3 Graph Layers 31 Chunks
      Type float64 numpy.ndarray
      372 372 31
    • blue
      (time, latitude, longitude)
      float64
      dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
      Array Chunk
      Bytes 32.73 MiB 1.06 MiB
      Shape (31, 372, 372) (1, 372, 372)
      Count 3 Graph Layers 31 Chunks
      Type float64 numpy.ndarray
      372 372 31
    • nir08
      (time, latitude, longitude)
      float64
      dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
      Array Chunk
      Bytes 32.73 MiB 1.06 MiB
      Shape (31, 372, 372) (1, 372, 372)
      Count 3 Graph Layers 31 Chunks
      Type float64 numpy.ndarray
      372 372 31
    • qa_pixel
      (time, latitude, longitude)
      uint16
      dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
      nodata :
      1
      Array Chunk
      Bytes 8.18 MiB 270.28 kiB
      Shape (31, 372, 372) (1, 372, 372)
      Count 1 Graph Layer 31 Chunks
      Type uint16 numpy.ndarray
      372 372 31

View RGB (real color) images from the time series¶

Data is available for Landsat-8 from April 2013 onwards and for Landsat-9 from February 2022 onwards. Typically, our region is viewed every 8 days, with additional scenes due to overlaps. Over a 5-month example, 31 time slices touch our region, but only 7 are very clear, with several others being partially cloudy.

In [9]:
plot_xx = xx[["red","green","blue"]].to_array()
plot = plot_xx.plot.imshow(col='time', col_wrap=4, robust=True, vmin=0, vmax=0.3)
plot.fig.suptitle("Fig. 1. Regions Captured from Landsat", fontsize=15, fontweight='bold')
plot.fig.subplots_adjust(top=0.96)
plt.show()
In [10]:
# Select a time slice to view a simple RGB image and the cloud mask
# See the XARRAY dimensions above for the number of time slices (starts at 0)

# Slice #6 - Mostly Clear
# Slice #24 - Scattered Clouds
# Slice #7 - Very Cloudy

time_slice = 24
In [11]:
# Plot and RGB Real Color Image for a single date
fig, ax = plt.subplots(figsize=(7, 5))
xx.isel(time=time_slice)[["red", "green", "blue"]].to_array().plot.imshow(robust=True, ax=ax, vmin=0, vmax=0.3)
ax.set_title("Fig. 2. RGB Real Color", fontweight='bold')
ax.axis('off')
plt.show()

Applying Cloud Filtering and Masking¶

Cloud masking for Landsat Collection-2 Level-2 data involves using the "qa_pixel" band to classify pixels as cloud, water, or cloud shadow. Creating a mask for these categories allows us to extract clear pixels for assessing vegetation state.

In [12]:
# To mask the pixels and find clouds or water, it is best to use the bit values of the 16-bit qa_pixel flag
# See the website above for a nice explanation of the process

bit_flags = {
            'fill': 1<<0,
            'dilated_cloud': 1<<1,
            'cirrus': 1<<2, 
            'cloud': 1<<3,
            'shadow': 1<<4, 
            'snow': 1<<5, 
            'clear': 1<<6,
            'water': 1<<7
}
In [13]:
# Create a function that will mask pixels with a given type

def get_mask(mask, flags_list):
    
    # Create the result mask filled with zeros and the same shape as the mask
    final_mask = np.zeros_like(mask)
    
    # Loop through the flags  
    for flag in flags_list:
        
        # get the mask for each flag
        flag_mask = np.bitwise_and(mask, bit_flags[flag])
        
        # add it to the final flag
        final_mask = final_mask | flag_mask
    
    return final_mask > 0
In [14]:
# Pick a single time slice to view a mask with clouds and water
sample_xx = xx.isel(time=time_slice)
In [15]:
# Find the pixels that are no data (fill), clouds, cloud shadows, or water
my_mask = get_mask(sample_xx['qa_pixel'],
                   ['fill', 'dilated_cloud', 'cirrus', 
                    'cloud', 'shadow', 'water'])
In [16]:
# Show only the mask (Yellow) with valid data in Purple
plt.figure(figsize=(7,5))
plt.imshow(my_mask)
plt.title("Fig. 3. Cloud / Shadows / Water Mask > YELLOW", fontweight='bold')
plt.axis('off')
plt.show()
In [17]:
# Create an RGB function that will display the mask over the background RGB image

def plot_masked_rgb(red, green, blue, mask, color_mask=(1, 0, 0), transparency=0.5, brightness=2):
    
    # to improve our visualization, we will increase the brightness of our values
    red = red / red.max() * brightness
    green = green / green.max() * brightness
    blue = blue / blue.max() * brightness
    
    red = np.where(mask==True, red*transparency+color_mask[0]*(1-transparency), red)
    green = np.where(mask==True, green*transparency+color_mask[1]*(1-transparency), green)
    blue = np.where(mask==True, blue*transparency+color_mask[2]*(1-transparency), blue)
    
    rgb = np.stack([red, green, blue], axis=2)
    
    return rgb

rgb = plot_masked_rgb(sample_xx['red'], sample_xx['green'], sample_xx['blue'], my_mask, color_mask=(1, 0, 1), transparency=0.2, brightness=3)
In [18]:
# This is a nice image that shows the clouds and water pixels (Purple) among clear land pixels
plt.figure(figsize=(7,5))
plt.imshow(rgb)
plt.title("Fig. 4. Cloud / Shadows / Water Mask > MAGENTA", fontweight='bold')
plt.axis('off')
plt.show()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

After applying cloud masking and filtering to our region, we'll calculate the Normalized Difference Vegetation Index (NDVI) for both filtered and unfiltered pixels to compare accuracy. We will eventually calculate NDVI for all the locations in our dataset.

NDVI measures vegetation "greenness" with a range of 0.0 to 1.0. Low values (0.0-0.25) indicate little vegetation, middle values (0.25-0.6) represent growing crops, and high values (0.6-1.0) signify peak vegetation. The equation is: NDVI = (NIR-Red) / (NIR+Red).

The NDVI plot shows unfiltered (BLUE) and filtered (GREEN) pixels. Gaps in filtered data signify unavailable clean pixels for NDVI calculation. Similar values suggest low cloud volume, while large differences indicate high cloud volume affecting NDVI. Filtering clouds before NDVI calculation improves phenology accuracy for agricultural plots.

In [19]:
# Calculate the mask for the entire xarray (all time slices)
full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])

# Create a "clean" dataset with the mask applied 
cleaned_data = xx.where(~full_mask)

# Calculate the mean of the data across the sample region and then NDVI
# Perform this calculation for the unfiltered and cloud-filtered (clean) datasets
mean_unfiltered = xx.mean(dim=['longitude','latitude']).compute()
ndvi_mean = (mean_unfiltered.nir08-mean_unfiltered.red)/(mean_unfiltered.nir08+mean_unfiltered.red)
mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)
In [20]:
fig = plt.figure(figsize=(7, 5))
ndvi_mean_clean.plot(color='green',marker='o',markersize=4,linewidth=2, label="Filtered = Clouds and Water Removed")
ndvi_mean.plot(color='blue',marker='o',markersize=4,linewidth=2, label="All Pixels = Clouds and Water Included")
plt.title("Fig. 5. Mean NDVI (Vegetation Index)", fontweight='bold')
plt.xlabel("Date")
plt.ylabel("NDVI")
plt.ylim(-0.1,1.0)
plt.legend(loc="upper right", markerscale=2., scatterpoints=1, fontsize=10)
plt.show()
In [21]:
# Plot an NDVI image for a single date with few clouds
# We will select image (index=6) on January 2, 2022. Notice how the water is masked out.

fig = plt.figure(figsize=(7, 5))
ndvi_image = (cleaned_data.nir08-cleaned_data.red)/(cleaned_data.nir08+cleaned_data.red)
ndvi_image.isel(time=6).plot(vmin=0.0, vmax=0.8, cmap="RdYlGn")
plt.title("Fig. 6. NDVI", fontweight='bold')
plt.axis('off')
plt.show()

For Sentinel 1¶

In [22]:
box_size_deg2 = 0.0004 # Surrounding box in degrees, yields approximately 5x5 pixel region
resolution2 = 10  # meters per pixel 
scale2 = resolution2 / 111320.0 # degrees per pixel for crs=4326 

catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["sentinel-1-rtc"], bbox=bounds, datetime=time_window)
items = list(search.get_all_items())
print('This is the number of scenes that touch our region:',len(items))
This is the number of scenes that touch our region: 27
In [23]:
# Load the data using Open Data Cube
data = stac_load(items,bands=["vv", "vh"], patch_url=pc.sign, bbox=bounds, crs="EPSG:4326", resolution=scale2)

# View the details of our xarray dataset
# The X and Y dimensions tell us the pixel dimensions of our bounding box
# The "time" variable is the number of scenes that touch our region
data
Out[23]:
<xarray.Dataset>
Dimensions:      (latitude: 1114, longitude: 1114, time: 27)
Coordinates:
  * latitude     (latitude) float64 10.49 10.49 10.49 ... 10.39 10.39 10.39
  * longitude    (longitude) float64 105.3 105.3 105.3 ... 105.4 105.4 105.4
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2021-12-04T22:46:07.919581 ... 2022-04...
Data variables:
    vv           (time, latitude, longitude) float32 0.02417 0.01493 ... 0.272
    vh           (time, latitude, longitude) float32 0.004146 ... 0.01252
xarray.Dataset
    • latitude: 1114
    • longitude: 1114
    • time: 27
    • latitude
      (latitude)
      float64
      10.49 10.49 10.49 ... 10.39 10.39
      units :
      degrees_north
      resolution :
      -8.983111749910169e-05
      crs :
      EPSG:4326
      array([10.489086, 10.488996, 10.488906, ..., 10.389283, 10.389193, 10.389103])
    • longitude
      (longitude)
      float64
      105.3 105.3 105.3 ... 105.4 105.4
      units :
      degrees_east
      resolution :
      8.983111749910169e-05
      crs :
      EPSG:4326
      array([105.283821, 105.283911, 105.284001, ..., 105.383624, 105.383714,
             105.383803])
    • spatial_ref
      ()
      int32
      4326
      spatial_ref :
      GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble",MEMBER["World Geodetic System 1984 (Transit)"],MEMBER["World Geodetic System 1984 (G730)"],MEMBER["World Geodetic System 1984 (G873)"],MEMBER["World Geodetic System 1984 (G1150)"],MEMBER["World Geodetic System 1984 (G1674)"],MEMBER["World Geodetic System 1984 (G1762)"],MEMBER["World Geodetic System 1984 (G2139)"],ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[2.0]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],CS[ellipsoidal,2],AXIS["geodetic latitude (Lat)",north,ORDER[1],ANGLEUNIT["degree",0.0174532925199433]],AXIS["geodetic longitude (Lon)",east,ORDER[2],ANGLEUNIT["degree",0.0174532925199433]],USAGE[SCOPE["Horizontal component of 3D system."],AREA["World."],BBOX[-90,-180,90,180]],ID["EPSG",4326]]
      crs_wkt :
      GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble",MEMBER["World Geodetic System 1984 (Transit)"],MEMBER["World Geodetic System 1984 (G730)"],MEMBER["World Geodetic System 1984 (G873)"],MEMBER["World Geodetic System 1984 (G1150)"],MEMBER["World Geodetic System 1984 (G1674)"],MEMBER["World Geodetic System 1984 (G1762)"],MEMBER["World Geodetic System 1984 (G2139)"],ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]],ENSEMBLEACCURACY[2.0]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],CS[ellipsoidal,2],AXIS["geodetic latitude (Lat)",north,ORDER[1],ANGLEUNIT["degree",0.0174532925199433]],AXIS["geodetic longitude (Lon)",east,ORDER[2],ANGLEUNIT["degree",0.0174532925199433]],USAGE[SCOPE["Horizontal component of 3D system."],AREA["World."],BBOX[-90,-180,90,180]],ID["EPSG",4326]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314245179
      inverse_flattening :
      298.257223563
      reference_ellipsoid_name :
      WGS 84
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      WGS 84
      grid_mapping_name :
      latitude_longitude
      array(4326, dtype=int32)
    • time
      (time)
      datetime64[ns]
      2021-12-04T22:46:07.919581 ... 2...
      array(['2021-12-04T22:46:07.919581000', '2021-12-05T11:11:54.609425000',
             '2021-12-10T22:45:35.316746000', '2021-12-16T22:46:07.511215000',
             '2021-12-17T11:11:54.196354000', '2021-12-22T22:45:34.433936000',
             '2021-12-28T22:46:06.761481000', '2021-12-29T11:11:53.438580000',
             '2022-01-09T22:46:06.347730000', '2022-01-10T11:11:53.057355000',
             '2022-01-21T22:46:05.657153000', '2022-01-22T11:11:52.377922000',
             '2022-02-02T22:46:04.929540000', '2022-02-03T11:11:51.699083000',
             '2022-02-14T22:46:05.071585000', '2022-02-15T11:11:51.767747000',
             '2022-02-26T22:46:04.969244000', '2022-03-10T22:46:04.925705000',
             '2022-03-11T11:11:51.691272000', '2022-03-22T22:46:05.287764000',
             '2022-03-23T11:11:52.064532000', '2022-04-03T22:46:05.477503000',
             '2022-04-04T11:11:52.234391000', '2022-04-15T22:46:05.644861000',
             '2022-04-16T11:11:52.470632000', '2022-04-27T22:46:06.296981000',
             '2022-04-28T11:11:53.135539000'], dtype='datetime64[ns]')
    • vv
      (time, latitude, longitude)
      float32
      0.02417 0.01493 ... 0.3726 0.272
      nodata :
      -32768
      array([[[0.02417101, 0.01493058, 0.01866966, ..., 0.1351117 ,
               0.13312186, 0.07470179],
              [0.01556157, 0.01662732, 0.01762288, ..., 0.09577081,
               0.10823271, 0.08096705],
              [0.01048668, 0.0134217 , 0.0117636 , ..., 0.10771083,
               0.08537243, 0.08583598],
              ...,
              [0.07889048, 0.05197918, 0.09618551, ..., 0.01098593,
               0.01214612, 0.01412993],
              [0.06246216, 0.04127437, 0.07787105, ..., 0.01052941,
               0.00976794, 0.00828947],
              [0.12281648, 0.0785541 , 0.07129239, ..., 0.00410046,
               0.00807947, 0.00576253]],
      
             [[0.01406382, 0.0143489 , 0.01710286, ..., 0.09546202,
               0.05939918, 0.07858594],
              [0.02011441, 0.02592376, 0.02726987, ..., 0.19192752,
               0.06553745, 0.07993296],
              [0.01974167, 0.02631255, 0.02614574, ..., 0.16457804,
               0.04583229, 0.03341462],
      ...
              [0.2212939 , 0.16180752, 0.12709288, ..., 0.29225406,
               0.24925222, 0.26113272],
              [0.31790692, 0.19440225, 0.08606878, ..., 0.31218913,
               0.51343554, 0.48167843],
              [0.43326014, 0.32252944, 0.16411823, ..., 0.45604324,
               0.58543694, 0.48651794]],
      
             [[0.16879988, 0.22080727, 0.45312023, ..., 0.06394066,
               0.0531206 , 0.06490773],
              [0.17604157, 0.25631052, 0.43659902, ..., 0.1027702 ,
               0.07629566, 0.09150371],
              [0.14171255, 0.26319695, 0.4560728 , ..., 0.11853301,
               0.11126872, 0.2265447 ],
              ...,
              [0.15140437, 0.2989662 , 0.37947145, ..., 0.1524738 ,
               0.20052004, 0.17682123],
              [0.17726156, 0.2191192 , 0.28943285, ..., 0.19817811,
               0.2196678 , 0.21833059],
              [0.24807045, 0.21953791, 0.3113074 , ..., 0.37869424,
               0.3726446 , 0.2720224 ]]], dtype=float32)
    • vh
      (time, latitude, longitude)
      float32
      0.004146 0.003489 ... 0.015 0.01252
      nodata :
      -32768
      array([[[0.00414605, 0.00348855, 0.00265869, ..., 0.07254129,
               0.1134388 , 0.12465546],
              [0.00494762, 0.00205696, 0.00172492, ..., 0.06132611,
               0.07009073, 0.06592007],
              [0.00537293, 0.00438383, 0.00350324, ..., 0.04859182,
               0.03122397, 0.02754842],
              ...,
              [0.01847825, 0.02475439, 0.02264003, ..., 0.00810464,
               0.00908654, 0.00665346],
              [0.01511999, 0.03052613, 0.03599842, ..., 0.00781537,
               0.00840591, 0.00574743],
              [0.01947351, 0.02331621, 0.03448671, ..., 0.00533757,
               0.00448371, 0.00601815]],
      
             [[0.00823141, 0.00515232, 0.00490454, ..., 0.04391994,
               0.0363352 , 0.0267037 ],
              [0.00449162, 0.00264303, 0.00315097, ..., 0.04883291,
               0.04464468, 0.04391601],
              [0.00405031, 0.00365794, 0.00366866, ..., 0.05824593,
               0.06907108, 0.07108819],
      ...
              [0.01494024, 0.01399606, 0.03577208, ..., 0.01380739,
               0.01006532, 0.01352517],
              [0.01282636, 0.01618785, 0.0435005 , ..., 0.03398733,
               0.03114486, 0.03133415],
              [0.03378278, 0.01576377, 0.02611222, ..., 0.04463165,
               0.03879599, 0.0292155 ]],
      
             [[0.02383078, 0.01609869, 0.02356882, ..., 0.03312046,
               0.01375357, 0.0185879 ],
              [0.02813808, 0.02384822, 0.03154745, ..., 0.02769672,
               0.01905811, 0.0353126 ],
              [0.02312637, 0.02367629, 0.02685913, ..., 0.02781266,
               0.03327138, 0.03130198],
              ...,
              [0.03032953, 0.04036504, 0.05917789, ..., 0.01958033,
               0.02263096, 0.01713172],
              [0.05123728, 0.07310684, 0.10459596, ..., 0.02186244,
               0.02186274, 0.01526174],
              [0.06402529, 0.09814875, 0.14598268, ..., 0.01328868,
               0.01499531, 0.01251877]]], dtype=float32)

Sentinel-1 SAR data uses VV and VH polarizations to capture radar backscatter from the Earth's surface, creating grayscale images that depict low backscatter as dark and high backscatter as light (Xu (2021)). This method helps identify surface features like water bodies (dark), vegetation (medium-light), and urban areas (light). The code visualizes VV and VH polarizations of Sentinel-1 SAR data as grayscale images for a specific time slice, aiding in the understanding of backscatter distribution and surface characteristics.

In [24]:
# Assuming that the loaded data is in 'data' variable, let's visualize the vv and vh bands
vv_band = data.vv
vh_band = data.vh

# You can select a specific time slice to visualize, e.g., the first one:
time_index = 0
vv_slice = vv_band.isel(time=time_index)
vh_slice = vh_band.isel(time=time_index)

# Plot vv polarization
plt.figure(figsize=(7, 5))
vv_slice.plot(cmap='gray', robust=True)
plt.title('Fig. 7. Sentinel-1 VV Polarization', fontweight='bold')
plt.show()

# Plot vh polarization
plt.figure(figsize=(7, 5))
vh_slice.plot(cmap='gray', robust=True)
plt.title('Fig. 8. Sentinel-1 VH Polarization', fontweight='bold')
plt.show()

Let us calculate our Features (Predictor Variables) ¶

After visualizing and assessing Landsat and Sentinel 1 data, we'll calculate predictor variables for our prediction model. Using spatial information (latitude, longitude) and harvest dates, we'll derive predictors from both Landsat and Sentinel 1 data, similar to NDVI calculations.

We made an initial selection of predictor variables through robust research; namely Miao (2011), Srivastava (2012), Wu (2015) and more.

From Landsat ¶

We use filtered data, free from clouds and water bodies, to calculate vegetation indices. Indices are computed after applying a masking algorithm on each location in our dataset. Instead of using data for an entire crop cycle (like we did above in our sampling model), we request data for a time window around the harvest date. As Landsat provides data every 8 days, we set a time window for 8 days before and after the harvest date. This approach accounts for accurate cubic-spline interpolation, reduces noise, memory usage, processing power, processing time, and enhances data reliability.

Along with NDVI (spoken above), we chose NDWI, SAVI, EVI2 and Albedo to calculate from Landsat.

NDWI (Normalized Difference Water Index): The index detects water presence in vegetation, useful for identifying waterlogged and flood-affected areas, and monitoring irrigation and water management. Ranging from -1.0 to 1.0, negative values indicate low water content, while increasing values signify higher water content. It's calculated as (NIR - SWIR) / (NIR + SWIR), with NIR representing the near-infrared band and SWIR the shortwave-infrared band.

SAVI (Soil Adjusted Vegetation Index): The index adjusts for soil brightness influence on vegetation indices and is helpful in low vegetation areas with high soil reflectance. SAVI ranges from -1.0 to 1.0, with negative values indicating water, values close to 0 for bare soil/low vegetation, up to 0.25 for sparse vegetation, 0.6 for growing vegetation, and 0.6 to 1.0 for dense vegetation. It's calculated as ((NIR - Red) / (NIR + Red + L)) * (1 + L), with L being a soil adjustment factor (typically between 0 and 1).

EVI 2 (Enhanced Vegetation Index 2): EVI 2, a modification of the original EVI, requires only red and near-infrared bands, making it suitable for sensors like Landsat. Useful for monitoring vegetation health and productivity, it's less sensitive to atmospheric conditions and soil background signals than NDVI. Ranging from -1.0 to 1.0, negative values indicate water, and increasing values represent more vegetation. It's calculated as 2.5 ((NIR - Red) / (NIR + 2.4 Red + 1)).

Albedo: Albedo measures surface reflectivity, with values ranging from 0 (completely absorbing) to 1 (completely reflecting). In vegetation, albedo is crucial for understanding energy balance and heat exchange between land and atmosphere. Higher albedo values in croplands can indicate healthier vegetation, as denser, greener canopies reflect more solar radiation.

By combining these vegetation indices, we gained a comprehensive understanding of the health, productivity, and water management of the croplands, as well as the energy balance and heat exchange between the land surface and the atmosphere.

In [25]:
# Load the data using Open Data Cube for each location
#EVI 2 Coeeficients selected from current literature
G = 2.5
C1 = 6
C2 = 7.5
L = 1

latitudes = []
longitudes = []
albedo_values = []
ndvi_landsat = []
evi2_landsat = []
savi_landsat = []
ndwi_landsat = []

for index, row in locations.iterrows():
    lat = row['Latitude']
    lon = row['Longitude']
    toh = row['Date of Harvest']
    date_object = datetime.strptime(toh, "%d-%m-%Y")
    
    trial_date = date_object.strftime("%Y-%m-%d")
    time_delta = timedelta(days=8)     
    trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
    start_date = (trial_datetime - time_delta)
    end_date = (trial_datetime + time_delta)
    time_window3 = f"{start_date}/{end_date}"

    #Load the data using Open Data Cube
    bounds = (lon - box_size_deg1/2, lat - box_size_deg1/2, lon + box_size_deg1/2, lat + box_size_deg1/2)

    #Load the data using Open Data Cube
    catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    search = catalog.search(collections=["landsat-c2-l2"], bbox=bounds, datetime=time_window3, query={"platform": {"in": ["landsat-8", "landsat-9"]}})
    items = list(search.get_all_items())
    
    xx = stac_load(
    items,
    bands=["red", "green", "blue", "nir08","swir16","qa_pixel"],
    crs="EPSG:4326", 
    resolution=scale1,
    chunks={"x": 2048, "y": 2048},
    patch_url=pc.sign,
    bbox=bounds
)
    
    # Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
    # https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
    xx['red'] = (xx['red']*0.0000275)-0.2
    xx['green'] = (xx['green']*0.0000275)-0.2
    xx['blue'] = (xx['blue']*0.0000275)-0.2
    xx['nir08'] = (xx['nir08']*0.0000275)-0.2
    xx['swir16'] = (xx['swir16']*0.0000275)-0.2
 

    # Calculate the mask for the entire xarray (all time slices)
    full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])

    # Create a "clean" dataset with the mask applied 
    cleaned_data = xx.where(~full_mask)
   
    mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
    ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)
    evi2_mean_clean = G * ((mean_clean.nir08 - mean_clean.red) / (mean_clean.nir08 + C1 * mean_clean.red - C2 * mean_clean.blue + L))
    savi_mean_clean = ((mean_clean.nir08 - mean_clean.red) / (mean_clean.nir08 + mean_clean.red + 0.5)) * 1.5
    ndwi_mean_clean = (mean_clean.green - mean_clean.nir08) / (mean_clean.green + mean_clean.nir08)
    albedo = 0.356 * mean_clean.blue + 0.130 * mean_clean.green + 0.373 * mean_clean.red + 0.085 * mean_clean.nir08 + 0.072 * mean_clean.swir16 +0.0018 # constant term
   
	# Appending the latitude, longitude, NDVI, NDWI, EVI 2, SAVI and Albedo values to their respective lists
    latitudes.append(lat)
    longitudes.append(lon)
    albedo_values.append(albedo)
    ndvi_landsat.append(ndvi_mean_clean)
    evi2_landsat.append(evi2_mean_clean)
    savi_landsat.append(savi_mean_clean)
    ndwi_landsat.append(ndwi_mean_clean)
 

We need to make a new for loop specifically to find land surface temperature. This is simply because we utilise the thermal bands offered by Landsat 9 (for it's higher accuracy and more reliable data). We'll tweak the code to specifically get data from Landsat 9, and we'll include the thermal band we are interested in.

In [26]:
# Load the data using Open Data Cube for each location
lst_landsat = []

for index, row in locations.iterrows():
    lat = row['Latitude']
    lon = row['Longitude']
    toh = row['Date of Harvest']
    date_object = datetime.strptime(toh, "%d-%m-%Y")
    
    trial_date = date_object.strftime("%Y-%m-%d")
    time_delta = timedelta(days=8)     # Taking for 8 originally with 8 up and 8 down
    trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
    start_date = (trial_datetime - time_delta)
    end_date = (trial_datetime + time_delta)
    time_window3 = f"{start_date}/{end_date}"

    #Load the data using Open Data Cube
    bounds = (lon - box_size_deg1/2, lat - box_size_deg1/2, lon + box_size_deg1/2, lat + box_size_deg1/2)

    try:
        #Load the data using Open Data Cube
        catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
        search = catalog.search(collections=["landsat-c2-l2"], bbox=bounds, datetime=time_window3, query={"platform": {"in": ["landsat-9"]}})
        items = list(search.get_all_items())

        #data = xr.open_rasterio(items, chunks={"x": 2048, "y": 2048}, crs="EPSG:4326", resolution=scale, patch_url=sign)
        xx = stac_load(
        items,
        bands=["red", "green", "blue", "nir08","swir16","qa_pixel", "lwir11"],
        crs="EPSG:4326", # Latitude-Longitude
        resolution=scale1, # Degrees
        chunks={"x": 2048, "y": 2048},
        patch_url=pc.sign,
        bbox=bounds
    )

        # Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
        # https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
        xx['red'] = (xx['red']*0.0000275)-0.2
        xx['green'] = (xx['green']*0.0000275)-0.2
        xx['blue'] = (xx['blue']*0.0000275)-0.2
        xx['nir08'] = (xx['nir08']*0.0000275)-0.2
        xx['swir16'] = (xx['swir16']*0.0000275)-0.2
        xx['lwir11'] = (xx['lwir11'] * 0.0003342) + 0.1

        full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])
        cleaned_data = xx.where(~full_mask)
        mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
        ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)

        # Constants adjusted to calculated temperature brightness from Landsat 9
        K1_CONSTANT_BAND_11 = 774.8853    
        K2_CONSTANT_BAND_11 = 1321.0789
        brightness_temperature = K2_CONSTANT_BAND_11 / np.log((K1_CONSTANT_BAND_11 / xx['lwir11']) + 1)

        # Emissivity values
        emissivity_vegetation = 0.99
        emissivity_soil = 0.97

        # Estimate emissivity using the NDVI threshold method
        ndvi_clean = (cleaned_data.nir08 - cleaned_data.red) / (cleaned_data.nir08 + cleaned_data.red)
        emissivity = emissivity_soil + (emissivity_vegetation - emissivity_soil) * np.where(ndvi_clean > 0.5, 1, 0)

        # Calculate LST in Celsius
        lst_celsius = (brightness_temperature / (1 + (0.00115 * brightness_temperature / 1.4388) * np.log(emissivity))) - 273.15
        cleaned_lst = lst_celsius.where(~full_mask)
        mean_lst = cleaned_lst.mean(dim=['longitude', 'latitude']).compute()
        lst_landsat.append(mean_lst)
    
    except:
        print(f"No data found for location {lat}, {lon}")
        continue
 
No data found for location 10.387123, 105.119326
No data found for location 10.426536, 105.115181
No data found for location 10.419552, 105.116409
No data found for location 10.403175, 105.123726
No data found for location 10.399444, 105.118528
No data found for location 10.42324, 105.126544
No data found for location 10.429339, 105.125493
No data found for location 10.435839, 105.132981

We are receiving no data (not just missing values) for some of the locations. To resolve this issue, we'll insert dummy xarray's (with common spatial information) and numpy array with NaN values. Then, we'll simply just interpolate (explained further down) these NaN values.

In [27]:
def create_and_insert_dataarray(index, date_of_harvest, lst_landsat):
    doh_dt = datetime.strptime(date_of_harvest, '%d-%m-%Y')
    ta = xr.DataArray(np.full(2, np.nan), dims='time', coords={'time': [doh_dt.strftime('%Y-%m-%dT03:20:15.057547'), doh_dt.strftime('%Y-%m-%dT03:20:15.057547')]})
    ta.attrs['spatial_ref'] = 4326
    lst_landsat.insert(index, ta)

# Assuming the locations DataFrame and lst_landsat list are defined

indices = [201, 204, 205, 209, 211, 306, 310, 323]

for index in indices:
    date_of_harvest = locations.iloc[index]['Date of Harvest']
    create_and_insert_dataarray(index, date_of_harvest, lst_landsat)

From Sentinel 1 ¶

Similarly, instead of pulling data for an entire crop cycle (like we did above in our sampling model), we only request data for a time window around the date of harvest. Sentinel 1 sends back data after every 6 days and hence, the 12 day time window.

In [28]:
# Utilising this for loamy soil as this is the best generalized soil type for Vietnam
A = 0.71
B = 1.40
sm_feature = []

# Load the data using Open Data Cube for each location
for index, row in locations.iterrows():
    lon = row['Longitude']
    lat = row['Latitude']
    toh = row['Date of Harvest']
    date_object = datetime.strptime(toh, "%d-%m-%Y")
    
    trial_date = date_object.strftime("%Y-%m-%d")
    time_delta = timedelta(days=6)
    trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
    start_date = (trial_datetime - time_delta)
    end_date = (trial_datetime + time_delta)
    time_window3 = f"{start_date}/{end_date}"
    
    bbox = (lon - box_size_deg2/2, lat - box_size_deg2/2, lon + box_size_deg2/2, lat + box_size_deg2/2)
    catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    search = catalog.search(collections=["sentinel-1-rtc"], bbox=bbox, datetime=time_window3)
    items = list(search.get_all_items())
    data = stac_load(items,bands=["vv", "vh"], patch_url=pc.sign, bbox=bbox, crs="EPSG:4326", resolution=scale2)
    mean = data.mean(dim=['latitude','longitude']).compute()
    dop = (mean.vv / (mean.vv + mean.vh))
    
    #Calculating Soil Moisture
    soil_moisture = (1 - ((10 ** (0.1 * dop)) / A) ** B)
    
    sm_feature.append(soil_moisture)
    

Now we check if our data has any missing values or 'NaN' values.

In [29]:
albedo_values[:5]
Out[29]:
[<xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...,
 <xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...,
 <xarray.DataArray (time: 2)>
 array([nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:38.965531 2022-07-21T...,
 <xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...,
 <xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...]

From the satellites, we see that we receive missing data, or in other words a number of missing values. To further make predictions using this data, we need to replace the 'NaN' values with real values. Replacing with zero or mean could be an option but this introduces greater bias in our data. We chose to interpolate the missing values, using cubic-spline, to accurately smoothen the data and replace the data with less bias.

In [30]:
def interpolate_and_create_dataframe(data_list, column_name, latitudes, longitudes):
    values_arrays = [data.values for data in data_list]
    lengths = [len(arr) for arr in values_arrays]
    values_flat = np.concatenate(values_arrays)

    # Get the indices of non-NaN values
    non_nan_indices = np.where(~np.isnan(values_flat))[0]
    non_nan_values = values_flat[non_nan_indices]

    # Create a cubic spline interpolator
    spline_interpolator = interp1d(non_nan_indices, non_nan_values, kind='cubic', bounds_error=False, fill_value=(non_nan_values[0], non_nan_values[-1]))

    # Replace NaN values with interpolated values
    interpolated_data = np.array([spline_interpolator(i) if np.isnan(value) else value for i, value in enumerate(values_flat)])

    # Split the interpolated_data back to original lengths
    interpolated_arrays = np.split(interpolated_data, np.cumsum(lengths)[:-1])

    data = {'Latitude': latitudes, 'Longitude': longitudes, column_name: interpolated_arrays}
    return pd.DataFrame(data)
In [31]:
al_df = interpolate_and_create_dataframe(albedo_values, 'Albedo', latitudes, longitudes)
ndvi_df = interpolate_and_create_dataframe(ndvi_landsat, 'NDVI', latitudes, longitudes)
ndwi_df = interpolate_and_create_dataframe(ndwi_landsat, 'NDWI', latitudes, longitudes)
evi2_df = interpolate_and_create_dataframe(evi2_landsat, 'EVI 2', latitudes, longitudes)
savi_df = interpolate_and_create_dataframe(savi_landsat, 'SAVI', latitudes, longitudes)
lst_df = interpolate_and_create_dataframe(lst_landsat, 'Land Surface Temperature', latitudes, longitudes)
sm_df = interpolate_and_create_dataframe(sm_feature, 'Soil Moisture', latitudes, longitudes)

al_df.head(5)
Out[31]:
Latitude Longitude Albedo
0 10.510542 105.248554 [0.09556685977230825, 0.09556685977230825, 0.0...
1 10.509150 105.265098 [0.09556685977230825, 0.09556685977230825, 0.0...
2 10.467721 105.192464 [0.09556685977230825, 0.09556685977230825]
3 10.494453 105.241281 [0.09556685977230825, 0.09556685977230825, 0.0...
4 10.535058 105.252744 [0.09556685977230825, 0.09556685977230825, 0.0...
In [32]:
def count_nan_df_col(df, column_name):
    count = 0
    for array in df[column_name]:
        count += np.isnan(array).sum()
    return count

print(f"Number of NaN values in Albedo DataFrame: {count_nan_df_col(al_df, 'Albedo')}")
print(f"Number of NaN values in NDVI DataFrame: {count_nan_df_col(ndvi_df, 'NDVI')}")
print(f"Number of NaN values in NDWI DataFrame: {count_nan_df_col(ndwi_df, 'NDWI')}")
print(f"Number of NaN values in SAVI DataFrame: {count_nan_df_col(savi_df, 'SAVI')}")
print(f"Number of NaN values in EVI 2 DataFrame: {count_nan_df_col(evi2_df, 'EVI 2')}")
print(f"Number of NaN values in Land Surface Temp DataFrame: {count_nan_df_col(lst_df, 'Land Surface Temperature')}")
print(f"Number of NaN values in Soil Moisture DataFrame: {count_nan_df_col(sm_df, 'Soil Moisture')}")
Number of NaN values in Albedo DataFrame: 0
Number of NaN values in NDVI DataFrame: 0
Number of NaN values in NDWI DataFrame: 0
Number of NaN values in SAVI DataFrame: 0
Number of NaN values in EVI 2 DataFrame: 0
Number of NaN values in Land Surface Temp DataFrame: 0
Number of NaN values in Soil Moisture DataFrame: 0

We now see that we do not have NaN values anymore. But each latitude and longitude combination has a list of indicator values. This intuitively makes sense, since we pulled in data from the satellites over a specific time window around the date of harvest. Since we chose a narrow time window, we can expect the values to be quite similar. So we will just take the mean of these index values for each latitude and longitude combination.

In [33]:
def find_mean(lst):
    return sum(lst) / len(lst)

al_df['Average Albedo'] = al_df['Albedo'].apply(find_mean)
evi2_df['Average EVI 2'] = evi2_df['EVI 2'].apply(find_mean)
ndvi_df['Average NDVI'] = ndvi_df['NDVI'].apply(find_mean)
savi_df['Average SAVI'] = savi_df['SAVI'].apply(find_mean)
ndwi_df['Average NDWI'] = ndwi_df['NDWI'].apply(find_mean)
lst_df['Average Land Surface Temperature'] = lst_df['Land Surface Temperature'].apply(find_mean)
sm_df['Average Soil Moisture'] = sm_df['Soil Moisture'].apply(find_mean)

al_df.head()
Out[33]:
Latitude Longitude Albedo Average Albedo
0 10.510542 105.248554 [0.09556685977230825, 0.09556685977230825, 0.0... 0.095567
1 10.509150 105.265098 [0.09556685977230825, 0.09556685977230825, 0.0... 0.095567
2 10.467721 105.192464 [0.09556685977230825, 0.09556685977230825] 0.095567
3 10.494453 105.241281 [0.09556685977230825, 0.09556685977230825, 0.0... 0.095567
4 10.535058 105.252744 [0.09556685977230825, 0.09556685977230825, 0.0... 0.095567
In [34]:
feature_df = pd.concat([al_df['Average Albedo'], evi2_df['Average EVI 2'], ndvi_df['Average NDVI'], savi_df['Average SAVI'], 
                        ndwi_df['Average NDWI'], sm_df['Average Soil Moisture'], lst_df['Average Land Surface Temperature']],axis=1)

feature_df.head()
Out[34]:
Average Albedo Average EVI 2 Average NDVI Average SAVI Average NDWI Average Soil Moisture Average Land Surface Temperature
0 0.095567 0.447983 0.635155 0.419686 -0.583988 -1.069717 66.657175
1 0.095567 0.447983 0.635155 0.419686 -0.583988 -1.105420 66.657175
2 0.095567 0.447983 0.635155 0.419686 -0.583988 -1.060978 66.657175
3 0.095567 0.447983 0.635155 0.419686 -0.583988 -1.017536 66.657175
4 0.095567 0.447983 0.635155 0.419686 -0.583988 -1.102196 66.657175

Abbas (2020) and Ray (2015) emphasize the importance of including precipitation as a predictor for rice yield variability. Therefore, we obtained precipitation data from a public website: Visual Crossing.

The code computes precipitation statistics (mean, maximum, minimum, and variance) for three districts in Vietnam based on historical weather data and various window sizes. These precipitation statistics are then combined with other indices for further analysis and modeling in predicting rice yield in the study area.

In [35]:
precip = weather_data['precip']
precip1 = weather_data1['precip']
precip2 = weather_data2['precip']
dates = locations['Date of Harvest']
unique_dates = dates.unique()

def sort_dates(date):
    day, month, year = date.split('-')
    return year, month, day

sorted_dates = sorted(unique_dates, key=sort_dates)

window_sizes = [168,170,171,172,174,175,176,177,178,182,183,184,185,186,187,188,190,191,193,194,
                197,201,203,278,283,285,286,287,288,290,292,293,294,295,296,301,302,304,307,308,312]

def calculate_precip_stats(precip, window_sizes):
    prec_mean = []
    prec_max = []
    prec_min = []
    prec_var = []

    for window_size in window_sizes:
        prec_mean.append(precip.iloc[:window_size].mean())
        prec_max.append(precip.iloc[:window_size].max())
        prec_min.append(precip.iloc[:window_size].min())
        prec_var.append(precip.iloc[:window_size].var())

    return prec_mean, prec_max, prec_min, prec_var

# Chau Phu
prec_df_cp, prec_max_df_cp, prec_min_df_cp, prec_var_df_cp = calculate_precip_stats(precip, window_sizes)

# Chau Thanh
prec_df_ct, prec_max_df_ct, prec_min_df_ct, prec_var_df_ct = calculate_precip_stats(precip1, window_sizes)

# Thoai Son
prec_df_ts, prec_max_df_ts, prec_min_df_ts, prec_var_df_ts = calculate_precip_stats(precip2, window_sizes)


# create a list of districts and precipitation data
districts = ['Chau_Phu'] * 41 + ['Chau_Thanh'] * 41 + ['Thoai_Son'] * 41
precipitation_data = [prec_df_cp, prec_df_ct, prec_df_ts]
max_precipitation_data = [prec_max_df_cp, prec_max_df_ct, prec_max_df_ts]
min_precipitation_data = [prec_min_df_cp, prec_min_df_ct, prec_min_df_ts]
var_precipitation_data = [prec_var_df_cp, prec_var_df_ct, prec_var_df_ts]
dates_extended = sorted_dates * 3

# flatten the precipitation data list
precipitation_data = [item for sublist in precipitation_data for item in sublist]
max_precipitation_data = [item for sublist in max_precipitation_data for item in sublist]
min_precipitation_data = [item for sublist in min_precipitation_data for item in sublist]
var_precipitation_data = [item for sublist in var_precipitation_data for item in sublist]

# create a dictionary with districts and precipitation data
prec = {'District': districts, 'Date of Harvest': dates_extended, 'Precipitation': precipitation_data, 'Precipitation Variance': var_precipitation_data}

# create a DataFrame from the dictionary
df = pd.DataFrame(prec)

new_df = pd.DataFrame({'Average Albedo': feature_df['Average Albedo'], 'Average EVI 2': feature_df['Average EVI 2'], 
                       'Average NDVI': feature_df['Average NDVI'] , 'Average SAVI': feature_df['Average SAVI'], 'Average NDWI': feature_df['Average NDWI'],
                      'Average Land Surface Temperature': feature_df['Average Land Surface Temperature'], 'Average Soil Moisture': feature_df['Average Soil Moisture']})
new_df['Date of Harvest'] = locations['Date of Harvest']
new_df['District'] = locations['District']

new_df = pd.merge(left=new_df, right=df, on=['District', 'Date of Harvest'], how='left')
new_df.drop(columns=['District', 'Date of Harvest'], inplace=True)
new_df.head()
Out[35]:
Average Albedo Average EVI 2 Average NDVI Average SAVI Average NDWI Average Land Surface Temperature Average Soil Moisture Precipitation Precipitation Variance
0 0.095567 0.447983 0.635155 0.419686 -0.583988 66.657175 -1.069717 6.304514 177.830676
1 0.095567 0.447983 0.635155 0.419686 -0.583988 66.657175 -1.105420 6.304514 177.830676
2 0.095567 0.447983 0.635155 0.419686 -0.583988 66.657175 -1.060978 6.304514 177.830676
3 0.095567 0.447983 0.635155 0.419686 -0.583988 66.657175 -1.017536 6.304514 177.830676
4 0.095567 0.447983 0.635155 0.419686 -0.583988 66.657175 -1.102196 6.322997 178.353735

Now let's start building the model ... ¶

Let's start off by predicting rice yield using a simple linear regression model. To hedge against overfitting and compensate for the lack of validation data, we split our dataset into a 75:25 train-test ratio instead of 80:20.

In [36]:
X = new_df
y = locations['Rice Yield (kg/ha)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

Linear Regression ¶

In [37]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predict on the test set
y_pred_linear = linear_model.predict(X_test)
r2_linear = r2_score(y_test, y_pred_linear)
mse_linear = mean_squared_error(y_test, y_pred_linear)

print(f"Linear Regression Accuracy: {r2_linear:.2f}")
print(f"Linear Regression MSE: {mse_linear**(1/2):.2f}")
Linear Regression Accuracy: 0.68
Linear Regression MSE: 436.25

The fact that 0.86 is the benchmark accuracy, a score of 0.68 seems like a good start. We can introduce some regularization to our regression model and go with Lasso.

Lasso Regression ¶

In [38]:
lasso_model = Lasso(alpha=0.5)
lasso_model.fit(X_train, y_train)

y_pred_lasso = lasso_model.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

print(f"Lasso Regression Accuracy: {r2_lasso:.2f}")
print(f"Lasso Regression MSE: {mse_lasso**(1/2):.2f}")
Lasso Regression Accuracy: 0.67
Lasso Regression MSE: 440.32

The accuracy didn't improve, nor did the MSE reduce. The added regularization is leading to a slightly worse performance compared to linear regression. The coefficient shrinkage caused by Lasso can be visualized below.

In [39]:
coeff_df = pd.DataFrame()
coeff_df['Feature'] = X_train.columns
coeff_df['Linear Coefficients'] = linear_model.coef_
coeff_df['Lasso Coefficients'] = lasso_model.coef_

coeff_df
Out[39]:
Feature Linear Coefficients Lasso Coefficients
0 Average Albedo -7207.746314 -0.000000
1 Average EVI 2 -1653.822764 -728.091388
2 Average NDVI 600.007497 446.673485
3 Average SAVI 2569.392649 -0.000000
4 Average NDWI 1666.820463 245.993988
5 Average Land Surface Temperature -1.849313 -3.391375
6 Average Soil Moisture 827.565980 219.788827
7 Precipitation -1243.472173 -1186.544723
8 Precipitation Variance 18.794090 18.538168

Let's move-on to employ advanced machine learning regression models to better capture the underlying patterns within our parameters.

Feature Extraction ¶

Before performing feature extraction, we decided to first visualize the correlations between our predictor variables to identify any underlying patterns among them.

In [40]:
Xy = pd.concat([X, y], axis=1)
corr_matrix = Xy.corr()

sns.heatmap(corr_matrix, annot=True, cmap='GnBu', fmt='.2f', vmin=-1, vmax=1, linewidths=0.5, cbar_kws={'shrink': 0.5})
plt.title("Fig. 9. Correlation Heatmap", fontweight='bold')
plt.show()

Recursive Feature Elimination (RFE) is an iterative technique for feature selection in machine learning. It trains a model on subsets of features, eliminating the least important ones until the desired count is reached. We're using RFE to identify the best features for the machine learning models we plan to test.

In [41]:
# Separate the features and target
def perform_rfe(estimator, X, y):
    with open(os.devnull, 'w') as devnull:
        with redirect_stdout(devnull):
            selector = RFECV(estimator, cv=5)
            selector.fit(X, y)
    return X.columns[selector.support_]

estimators = {
    "Random Forest Regressor": RandomForestRegressor(),
    "Gradient Boosting Regressor": GradientBoostingRegressor(),
    "XGB Regressor": XGBRegressor(),
    "CatBoost Regressor": CatBoostRegressor(logging_level='Silent')
}

selected_features = {}

for name, estimator in estimators.items():
    selected_features[name] = perform_rfe(estimator, X, y)
    print(f"Selected features for {name}:", selected_features[name])
Selected features for Random Forest Regressor: Index(['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature',
       'Average Soil Moisture', 'Precipitation', 'Precipitation Variance'],
      dtype='object')
Selected features for Gradient Boosting Regressor: Index(['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature',
       'Precipitation', 'Precipitation Variance'],
      dtype='object')
Selected features for XGB Regressor: Index(['Precipitation', 'Precipitation Variance'], dtype='object')
Selected features for CatBoost Regressor: Index(['Average Albedo', 'Average EVI 2', 'Average NDVI', 'Average NDWI',
       'Average Land Surface Temperature', 'Average Soil Moisture',
       'Precipitation', 'Precipitation Variance'],
      dtype='object')

We'll set up machine learning models tailored to our input parameters, experimenting with Random Forest, Gradient Boosting, XGBoost, and CatBoost. These four models effectively handle non-linear and complex relationships, as observed in the correlation heatmap, and provide robust regularization to prevent overfitting. Gradient Boosting combines multiple decision trees for better performance on complex datasets, while XGBoost offers an optimized, scalable version of Gradient Boosting. CatBoost, though commonly used for categorical data, is fast, efficient, and resistant to overfitting.

ML Regression Models ¶

In [42]:
def train_and_evaluate(model, model_name, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    print(f"{model_name} Accuracy: {r2:.2f}")
    print(f"{model_name} MSE: {mse**(1/2):.2f}")
    print()
    return model, y_pred

# Create model instances
rf_model = RandomForestRegressor(n_estimators=15, max_depth=5, random_state=42)
gb_model = GradientBoostingRegressor(n_estimators=14, learning_rate=0.21, random_state=42)
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=18, learning_rate=0.201, seed=42)
cb_model = CatBoostRegressor(learning_rate=0.21, depth=5, n_estimators=25, verbose=0, random_state=42)

selection1 = ['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance']
selection2 = ['Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance']
selection3 = ['Precipitation', 'Precipitation Variance']
selection4 = ['Average Albedo', 'Average EVI 2', 'Average NDVI', 'Average NDWI', 'Average Land Surface Temperature', 'Average Soil Moisture', 'Precipitation', 'Precipitation Variance']

X1 = new_df[selection1]
X2 = new_df[selection2]
X3 = new_df[selection3]
X4 = new_df[selection4]
y = locations['Rice Yield (kg/ha)']

X_train1, X_test1, y_train, y_test = train_test_split(X1, y, test_size=0.25, random_state=123)
X_train2, X_test2, y_train, y_test = train_test_split(X2, y, test_size=0.25, random_state=123)
X_train3, X_test3, y_train, y_test = train_test_split(X3, y, test_size=0.25, random_state=123)
X_train4, X_test4, y_train, y_test = train_test_split(X4, y, test_size=0.25, random_state=123)

# Train and evaluate models
rf_model, y_pred_rf = train_and_evaluate(rf_model, "Random Forest Regression", X_train1, y_train, X_test1, y_test)
gb_model, y_pred_gb = train_and_evaluate(gb_model, "Gradient Boosting Regression", X_train2, y_train, X_test2, y_test)
xgb_model, y_pred_xgb = train_and_evaluate(xgb_model, "XGBoost Regression", X_train3, y_train, X_test3, y_test)
cb_model, y_pred_cb = train_and_evaluate(cb_model, "CatBoost Regression", X_train4, y_train, X_test4, y_test)
Random Forest Regression Accuracy: 0.68
Random Forest Regression MSE: 437.55

Gradient Boosting Regression Accuracy: 0.68
Gradient Boosting Regression MSE: 434.28

XGBoost Regression Accuracy: 0.67
XGBoost Regression MSE: 445.02

CatBoost Regression Accuracy: 0.64
CatBoost Regression MSE: 461.87

These metrics don't appear to differ significantly from the Accuracy and MSE of Linear and Lasso regression. However, these metrics are likely more reliable, as they offer increased robustness against overfitting. We can visualize the performance of these models using scatter plots to better understand their regression characteristics.

In [43]:
data = pd.DataFrame({'Actual': y_test,
                     'Model1': y_pred_rf,
                     'Model2': y_pred_gb,
                     'Model3': y_pred_xgb,
                     'Model4': y_pred_cb})

fig, axes = plt.subplots(2, 2, figsize=(9, 8))

sns.regplot(ax=axes[0, 0], data=data, x='Actual', y='Model1', scatter_kws={'alpha': 0.3}, color = "Black")
axes[0, 0].set_title('Fig. 10. Random Forest', fontweight='bold')

sns.regplot(ax=axes[0, 1], data=data, x='Actual', y='Model2', scatter_kws={'alpha': 0.3}, color = "Orange")
axes[0, 1].set_title('Fig. 11. Gradient Boosting', fontweight='bold')

sns.regplot(ax=axes[1, 0], data=data, x='Actual', y='Model3', scatter_kws={'alpha': 0.3}, color = "#121bc7")
axes[1, 0].set_title('Fig. 12. XGBoost', fontweight='bold')

sns.regplot(ax=axes[1, 1], data=data, x='Actual', y='Model4', scatter_kws={'alpha': 0.3}, color="Red")
axes[1, 1].set_title('Fig. 13. CatBoost', fontweight='bold')

plt.tight_layout()
plt.show()

The ML models have been hyperparameter-tuned to their optimal capacity. However, out of curiosity, we explored whether further optimization of individual models or an ensemble approach could yield improved results. After several iterations, we successfully enhanced our Gradient Boosting model. Initially, Gradient Boosting had provided the highest accuracy and lowest MSE among the tested models. By further boosting it with carefully selected parameters, we achieved our best results.

Boosting ¶

In [44]:
adaboost_regressor = AdaBoostRegressor(base_estimator=gb_model, n_estimators=90, learning_rate=0.025, random_state=42)
adaboost_regressor.fit(X_train2, y_train)

y_pred_adaboost = adaboost_regressor.predict(X_test2)
r2_adaboost = r2_score(y_test, y_pred_adaboost)
mse_adaboost = mean_squared_error(y_test, y_pred_adaboost)

print(f"Boosting Regression Accuracy: {r2_adaboost:.2f}")
print(f"Boosting Regression MSE: {mse_adaboost**(1/2):.2f}")
Boosting Regression Accuracy: 0.69
Boosting Regression MSE: 430.36
In [45]:
data2 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_adaboost})
plt.figure(figsize=(5, 4))
ax = sns.regplot(data=data2, x='Actual', y='Predicted', scatter_kws={'alpha': 0.3})
ax.set_title('Fig. 14. Boosting', fontweight='bold')
plt.show()

Our final model achieved an accuracy of 0.69 and an MSE of 430.36. While not reaching 0.86, we consider this a modest success, given the reliance on satellite data and a limited ground-truth dataset. However, the model isn't optimal. For instance, the histogram of residuals reveals an approximately normal distribution with slight skewness, but the peak isn't centered around zero. The residuals' mean is also not close to zero, indicating model bias. These observations suggest room for further improvement in our model.

In [46]:
data2['Residuals'] = data2['Actual'] - data2['Predicted']
plt.figure(figsize=(5, 4))
ax = sns.histplot(data=data2, x='Residuals', kde=True, color='Black')
ax.set_title('Fig. 15. Residuals Distribution', fontweight='bold')
plt.show()

Bottlenecks and Methods of Improvement ¶

  1. Insufficient Validation Data: To properly assess overfitting or underfitting, a validation set is needed to evaluate model performance on unseen data. Interestingly, we found a high R-squared for Linear Regression, which might be due to overfitting. The model could be capturing excessive noise without undergoing RFE and having too many parameters.

  2. Feature Engineering Improvements: As new data and research become available, there's always potential for enhancement in feature engineering.

  3. Exploring Alternative Regression Techniques: Although we experimented with several models, there may be others that outperform those tested. For instance, Neural Networks can effectively capture complex relationships in a dataset.

Conclusion ¶

This project was both challenging and intriguing to work on. Each section required extensive research, whether it was determining how to visualize the Cloud Masking algorithm, examining RGB images from time series data, or identifying the most suitable method to address missing values (Cubic Spline Interpolation). Overall, this was an excellent learning experience and a genuinely enjoyable endeavor.

References ¶

  1. United Nations. (n.d.). https://unstats.un.org/sdgs/report/2022/
  2. Seungtaek Jeong et al. (2021). Sci. Total Environ. https://www.sciencedirect.com/science/article/pii/S0048969721048014
  3. Xu et al. (2021). Remote Sens. 13, 3994. https://doi.org/10.3390/rs13193994
  4. Miao et al. (2011) - Miao, Y., Mulla, D. J., & Robert, P. C. Precision Agriculture, 12(1), 34-50.
  5. Srivastava et al. (2012) - Srivastava, P. K., Han, D., Ramirez, M. A., & Bray, M. Environmental Monitoring and Assessment, 184(6), 3467-3482.
  6. Wu et al. (2015) - Wu, M., Niu, Z., Tang, Q., Huang, W., Rivard, B., & Feng, J. Agricultural and Forest Meteorology, 200, 251-263.
  7. 9-11. Visual Crossing. Chau Phu, Chau Thanh, Thoai Son Weather History. https://www.visualcrossing.com/weather-history/
  8. Ray, D.K., et al. (2015). Climate variation explains a third of global crop yield variability. Nature Communications, 6, 5989. https://doi.org/10.1038/ncomms6989
  9. Abbas & Mayo. (2020). Environ. Dev. Sustain. 23(2), 1706–1728. https://doi.org/10.1007/s10668-020-00647-8
  10. Xiao et al. (2006). Remote Sens. Environ. 100(1), 95-113. https://www.sciencedirect.com/science/article/pii/S0034425705003433

Contributions ¶

  1. Divyadarshan Punjabi: Feature Engineering for all Predictor Variables, Feature Extraction, ML Modelling, Boosting, Markdown explanations
  2. Nirvaan Rohira: Cloud and Water Masking Visualisations for Sample Location, Linear and Lasso Regression, Heatmap Visualisation, Markdown explanations
  3. Yasin Zahir: Research work: Introductions, finding research papers, Markdown explanations, Setting up the Outline and References Section

In the end, we build upon each individual work together; formulating the bottlenecks and improvements needed and the conclusion for the project.